2019 Crowdsourcing and Computer Vision class final project @ UT Austin ischool
Image caption is the primary instrument in assisting blind and visually impaired people to understand images and their surrounding environments. Image captioning in computer vision generally depended on the captions annotated by workers, which, however, could vary from each other.
This paper aims to address caption variances and use crowdsourcing platforms to prioritize key elements in these captions, in a bid to explore the information that needs to be included for generating satisfying and precise image captions.
We screened out 100 images with differently annotated captions and invited Amazon Mechanical Turk workers to highlight and rank 5 elements they regarded important after reviewing images. We analyzed the data collected by their word classes and content denotation, with special attention paid to whether subjective description was highlighted and whether there is any difference in the annotations from workers who have been informed of the special application and purpose of this study.
This paper hopes to provide a sample dataset of key elements in image captions, shed light into what information needs to be paid attention to in object recognition and detection, and share thoughts on whether the usage of image captions will modify the contents to be included in image captions.